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The data augmentation (DA) algorithm is a widely used Markov 
chain Monte Carlo algorithm that is easy to implement but often suf- 
fers from slow convergence. The sandwich algorithm is an alternative 
that can converge much faster while requiring roughly the same com- 
putational effort per iteration. Theoretically, the sandwich algorithm 
always converges at least as fast as the corresponding DA algorithm 
in the sense that \\K*\\ < \\K\\, where K and K* are the Markov 
operators associated with the DA and sandwich algorithms, respec- 
tively, and || • || denotes operator norm. In this paper, a substantial 
refinement of this operator norm inequality is developed. In partic- 
ular, under regularity conditions implying that if is a trace-class 
operator, it is shown that K* is also a positive, trace-class operator, 
and that the spectrum of K* dominates that of K in the sense that 
the ordered elements of the former are all less than or equal to the 
corresponding elements of the latter. Furthermore, if the sandwich 
algorithm is constructed using a group action, as described by Liu 
and Wu [J. Amer. Statist. Assoc. 94 (1999) 1264-1274] and Hobert 
and Marchev [Ann. Statist. 36 (2008) 532-554], then there is strict 
inequality between at least one pair of eigenvalues. These results are 
applied to a new DA algorithm for Bayesian quantile regression intro- 
duced by Kozumi and Kobayashi [J. Stat. Comput. Simul. 81 (2011) 
1565-1578]. 



1. Introduction. Suppose that /x ; X— > [0, oo) is an intractable proba- 
bility density that we would like to explore. Consider a data augmentation 
(DA) algorithm [Tanner and Wong (1987), Liu, Wong and Kong (1994)] 
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based on the joint density / : X x Y — > [0, oo), which of course must satisfy 



We are assuming here that X and Y are two sets equipped with countably 
generated cr-algebras, and that f(x, y) is a density with respect to fj, x v. The 
Markov chain underlying the DA algorithm, which we denote by {X n }^ =0 , 
has Markov transition density (Mtd) given by 



In other words, k(-\x) is the density of X n+ i, given that X n = x. It is well 
known and easy to see that the product k(x'\x)fx(x) is symmetric in (x,x'), 
that is, the DA Markov chain is reversible with respect to fx- (We assume 
throughout that all Markov chains on the target space, X, are Harris er- 
godic, that is, irreducible, aperiodic and Harris recurrent.) Of course, the 
DA Markov chain can be simulated by drawing alternately from the two con- 
ditional densities defined by f(x, y). If the current state is X n = x, then X n+ i 
is simulated in two steps: draw Y ~ fY\x('\ x ), call the result y, and then 
draw X n+ i ~ fx\Y{-\y)- 

Like its cousin the EM algorithm, the DA algorithm can be very slow to 
converge. A powerful method for speeding up the DA algorithm was discov- 
ered independently by Liu and Wu (1999) and Meng and van Dyk (1999). 
The basic idea behind the method (called "PX-DA" by Liu and Wu and 
"marginal augmentation" by Meng and van Dyk) is to introduce a (low- 
dimensional) parameter into the joint density f(x,y) that is not identifiable 
in the target, fx- This allows for the construction of an entire class of viable 
DA algorithms, some of which may converge much faster than the original. 
Here is a brief description of the method in the context where X and Y are 
Euclidean spaces, and f(x,y) is a density with respect to the Lebesgue mea- 
sure. Suppose that for each g in some set G, there is a function t g : Y — > Y that 
is one-to-one and differentiable. Consider a parametric family of densities 
(indexed by g) given by f(x,y;g) = f(x,t g (y))\J g (y)\, where J g (z) is the Ja- 
cobian of the transformation z = t~ 1 (y). Note that J* Y f(x, y; g) dy = fx(x), 
so g is not identifiable in fx- Now fix a "working prior" density on g, call 
it r(g), and define a joint density on X x Y as follows: 



Clearly, the x-marginal of f r (x, y) is the target, fx - Thus, each working prior 
leads to a new DA algorithm that is potentially better than the original one 
based on f(x,y). Liu and Wu (1999), Meng and van Dyk (1999) and van 
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Dyk and Meng (2001) find the working priors that lead to particularly fast 
algorithms. 

Of course, one iteration of the DA algorithm based on f r (x,y) could be 
simulated using the usual two-step method (described above) which entails 
drawing from the two conditional densities defined by f r (x,y). However, 
Liu and Wu (1999) showed that this simulation can also be accomplished 
using a three-step procedure in which the first and third steps are draws 
from fy\x('\ x ) an d fx\Y{'\v)i respectively, and the middle step involves 
a single move according to a Markov chain on the space Y that has invari- 
ant density fy(y)- In this paper, we study a generalization of Liu and Wu's 
(1999) three-step procedure that was introduced by Hobert and Marchev 
(2008) and is now described. 

Suppose that R{y,dy') is any Markov transition function (Mtf) on Y 
that is reversible with respect to fy(y)v(dy), that is, R(y,dy')fy(y)i'(dy) = 
R(y',dy) x fy(y')i'(dy'). Consider a new Mtd given by 

(1) fcVl*) = J J fx\ Y W)R(y, dy')f Y \ x (y\x)v(dy). 

It's easy to see that k*(x'\x)fx(x) is symmetric in (x,x r ), so the Markov 
chain defined by k* , which we denote by {X*}^ =0 , is reversible with respect 
to fx- If the current state of the new chain is X* = x, then X* +1 can be 
simulated using the following three-steps, which are suggested by the form 
of k*. Draw Y ~ fy\x('\ x )i cau the result y, then draw Y' ~ R(y,-), call 
the result y' , and finally draw X* +1 ~ fxy/i'W)- Again, the first and third 
steps are exactly the two steps used to simulate the original DA algorithm. 
Because the draw from R(y, •) is "sandwiched" between the draws from the 
two conditional densities, Yu and Meng (2011) call this new algorithm the 
"sandwich algorithm" and we will follow their lead. The PX-DA/marginal 
augmentation method can be viewed as one particular recipe for construct- 
ing R(y,dy'). Another general method for building R is described in Sec- 
tion 4. 

It is often possible to construct a sandwich algorithm that converges much 
faster than the underlying DA algorithm while requiring roughly the same 
computational effort per iteration. Examples can be found in Liu and Wu 
(1999), Meng and van Dyk (1999), van Dyk and Meng (2001), Marchev 
and Hobert (2004) and Hobert, Roy and Robert (2011). What makes this 
"free lunch" possible is the somewhat surprising fact that a low-dimensional 
perturbation on the Y space can lead to a major improvement in mixing. In 
fact, the chain driven by R is typically reducible, living in a small subspace 
of Y that is determined by its starting value. Drawing from such an R is 
usually much less expensive computationally than drawing from fy\x{'\ x ) 
and fx\v{-\y)- 
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Empirical studies pointing to the superiority of the sandwich algorithm 
abound. Unfortunately, the development of confirmatory theoretical results 
has been slow. It is known that the sandwich chain always converges at 
least as fast as the DA chain in the operator norm sense. Indeed, Hobert 
and Roman (2011) show that Yu and Meng's (2011) Theorem 1 can be used 
to show that 



where K, K* and R denote the usual Markov operators defined by k, k* 
and R(y,dy'), respectively, and || • || denotes the operator norm. (See Sec- 
tion 2 for more details as well as references.) Of course, we would like to 
be able to say that ||-K"*|| is strictly smaller than and this is certainly 

the case when ||i?|| < 1. However, the i?'s used in practice typically have 
norm 1 (because the corresponding chains are reducible). In fact, in most 
applications, R is reducible and idempotent, that is, J* Y R{u-> dy")R(y" , dy') = 



Hobert, Roy and Robert (2011) provided a refinement of (2) for the case in 
which Y is finite and R is reducible and idempotent. These authors showed 
that, in this case, K and K* both have pure eigenvalue spectra that are 
subsets of [0,1), and that at most m — 1 of the eigenvalues are nonzero, 
where |Y| = m < oo. They also showed that the spectrum of K* dominates 
that of K in the sense that < A* < A,; < 1 for all i, where Aj and A* denote 
the ith. largest eigenvalues of K and K* , respectively. Note that taking i = 1 
yields \\K*\\ = A| < Ai = pir||. 

In this paper we develop results that hold in the far more common situ- 
ation where |Y| =oo. First, we generalize Hobert, Roy and Robert's (2011) 
result by showing that the assumption that Y is finite can be replaced by 
the substantially weaker assumption that J x k(x\x)fj,(dx) < oo. In this more 
general case, K and K* still have pure eigenvalue spectra that are subsets of 
[0, 1) and an analogous domination holds, but the number of nonzero eigen- 
values is no longer necessarily finite. Second, we show that if R is constructed 
using a group action, as described by Liu and Wu (1999) and Hobert and 
Marchev (2008), then the domination is strict in the sense that there ex- 
ists at least one i such that < A* < Aj < 1. Finally, we apply our results 
to a new DA algorithm for Bayesian quantile regression that was recently 
introduced by Kozumi and Kobayashi (2011). 

The remainder of this paper is organized as follows. Section 2 contains 
a brief review of the relationship between the spectral properties of Markov 
operators and the convergence properties of the corresponding Markov chains. 
The DA and sandwich algorithms are formally defined and compared in Sec- 
tion 3. The construction of R using group actions is discussed in Section 4, 
and our analysis of Kozumi and Kobayashi's DA algorithm is described in 
Section 5. 



(2) 




R(y,dy'). 



SPECTRAL ANALYSIS OF DATA AUGMENTATION 



5 



2. Brief review of self-adjoint Markov operators. Let P(x, dx') be a gene- 
ric Mtf on X that is reversible with respect to fx(x)fJ,(dx). Denote the 
Markov chain driven by P as $ = {&n}%Lo- (Note that $ is not neces- 
sarily a DA Markov chain.) The convergence properties of <I> can be ex- 
pressed in terms of a related operator that is now defined. Let Lq(/x) 
be the space of real-valued functions with domain X that are square in- 
tegrable and have mean zero with respect to fx- In other words, g £ Lq(/x) 
if J x g p (x)fx(x)n(dx) is finite when p = 2, and vanishes when p = 1. This is 
a Hilbert space where the inner product of g, h £ L^{fx) is defined as 

(9,h)= / g(x)h(x)f x (x)n(dx), 
Jx 

and the corresponding norm is, of course, given by = (g,g) l l 2 . Let P: 
Lq(Jx) — > Lq(Jx) denote the operator that maps g £ Lq(Jx) to 

(Pg)(x)= ( g(x')P(x,dx'). 
Jx 

Note that (Pg)(x) is simply the conditional expectation of g($> n +i) given 
that $ n = x. Reversibility of the Mtf P(x, dx') is equivalent to the operator P 
being self-adjoint. The (operator) norm of P is defined as 

||p|| = sup H-Pflil, 

where L^i(fx) in the subset of Lq(Jx) that contains the functions g sat- 
isfying J x g 2 (x)fx(x)fJ.(dx) = 1. It's easy to see that ||P|| £ [0,1]. Roberts 
and Rosenthal (1997) show that ||P|| < 1 if and only if $ is geometrically er- 
godic. Moreover, in the geometrically ergodic case, ||P|| can be viewed as the 
asymptotic rate of convergence of $ [see, e.g., Rosenthal (2003), page 170]. 

If P satisfies additional regularity conditions, much more can be said 
about the convergence of the corresponding Markov chain. Assume that the 
operator P is compact and positive, and let jet;} denote its eigenvalues, 
all of which reside in [0,1). The number of eigenvalues may be finite or 
countably infinite (depending on the cardinality of X), but in either case 
there is a largest one and it is equal to ||P||. [For a basic introduction to the 
spectral properties of Markov operators, see Hobert, Roy and Robert (2011).] 
If tr(P) := X^i a « < °°> then P is a trace-class operator [see, e.g., Conway 
(1990), page 267]. As explained in Diaconis, Khare and Saloff-Coste (2008), 
when P is positive and trace-class, the chain's x 2_ distance to stationarity 
can be written explicitly as 

(3) I lpn{ ^-J^ )l \ (<y) = £ atefix), 

where p n (-\x) denotes the density of <£ n given that $0 = x, and {e^} is 
an orthonormal basis of eigen-functions corresponding to {on}. Of course, 
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the x 2 -distance serves as an upper bound on the total variation distance. 
Assume that the eigenvalues are ordered so that 014 > ctj+i, and let i* = 
max{i G N : aj = ai}. Asymptotically, the term a 2n (e 2 (x) H — ■ + e 2 *{x)) will 
dominate the sum on the right-hand side of (3). Hence, in this context it is 
certainly reasonable to call ||P|| = a\ the "asymptotic rate of convergence." 
Our focus in this paper will be on DA algorithms whose Markov operators 
are trace-class. 

3. Spectral comparison of the DA and sandwich algorithms. As in Sec- 
tion 1, let K:L 2 (f x )^ L 2 (f x ), K* : L 2 (f x ) ^ L 2 (f x ) and R:L 2 (f Y )^ 
Lo(fy) denote the (self-adjoint) Markov operators defined by the DA chain, 
the sandwich chain and R(y,dy'), respectively. We will exploit the fact 
that K and K* can be represented as products of simpler operators. In- 
deed, let P x : L 2 (f Y ) -> L 2 (f x ) map h 6 L 2 (f Y ) to 

(P x h)(x) = J h(y)f Ylx (y\x)u(dy) 

and, analogously, let P Y : L%(f x ) ->■ L%(f Y ) map g £ L^fx) to 

(Py9)(y)= / 9{x)f x \ Y (x\y)^(dx). 
Jx 

It is easy to see that K = PxPy and K* = PxRP Y - This representation 
of K was used in Diaconis, Khare and Saloff-Coste (2008). 

Again, as in Section 1, let /:X x Y— > [0, 00) be the joint density that 
defines the DA Markov chain. Throughout the next two sections, we assume 
that / satisfies the following condition: 

(4) / / f x \ Y (x\y)f Y \x(y\x)v(dy)fx(dx) < 00. 

Jx Jy 

Buja (1990) shows that (4) implies that K is a trace-class operator. It is clear 
that (4) holds if X and/or Y has a finite number of elements. However, (4) 
can also hold in situations where |X| = |Y| = 00. Indeed, in Section 5 we 
establish that (4) holds for a DA algorithm for Bayesian quantile regression 
where X and Y are both uncountable. On the other hand, (4) certainly does 
not hold for all DA algorithms. For example, (4) cannot hold if the DA 
chain is not geometrically ergodic (because subgeometric chains cannot be 
trace-class). Simple examples of subgeometric DA chains can be found in 
Papaspiliopoulos and Roberts (2008) and Tan (2008), Chapter 4. 

Condition (4) has appeared in the Markov chain Monte Carlo literature 
before. It is exactly the bivariate version of Liu, Wong and Kong's (1995) 
"Condition (b)" and it was also employed by Schervish and Carlin (1992). 
Unfortunately, there does not appear to be any simple, intuitive interpreta- 
tion of (4) in terms of the joint density f(x,y) or the corresponding Markov 
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chain. In fact, referring to their Condition (b), Liu, Wong and Kong (1995) 
state that "It is standard but not easy to check and understand." 

Our analysis of the DA and sandwich algorithms rests heavily upon a use- 
ful singular value decomposition of f(x,y) whose existence is implied by (4). 
In particular, Buja (1990) shows that if (4) holds, then 



• A) = 1, 9o = 1, h = 1. 

• {gi}°Z and {hi}°^L form orthonormal bases of L 2 (fx) and L 2 (/y), re- 
spectively. 

• ^ G [0, 1], and ^ < ft-i for all % G N. 

• Jx Jy 9i{x)hj(y)f{x,y)iy(dy)n(dx) =0]£i^j. 

A few remarks about notation are in order. First, we state all results for 
the case |X| = |Y| = oo, and leave it to the reader to make the obvious, minor 
modifications that are required when one or both of the spaces are finite. 
For example, in the singular value decomposition above, if one or both of the 
spaces are finite, then one or both of the orthonormal bases would have only 
a finite number of elements, etc. Second, we will let (•, •) and || • || do double 
duty as inner product and norm on both Lq(J'x) and Lp(/y). However, the 
norms of operators whose domains and ranges differ will be subscripted. 
The following result can be gleaned from calculations in Buja (1990), but 
we present a proof in Appendix A for completeness. 

Lemma 1. Assume that (4) holds and let X\ > A2 > A3 > • • • denote the 
eigenvalues of K, which reside in the set [0, 1). For each i G N, Pxh% = fiigi 
and Pygi = Pih%. Moreover, 



Here is the first of our two main results. 

Theorem 1. Assume that (4) holds and that R is idempotent with 
\\R\\ = 1. Define Z = max{z G N:/% = fix} and N = {i G N:ft > 0}. Then: 

(1) K* is a positive, trace- class operator. 

(2) A* < Xi for all i G N, where {A i }^ 1 and {A*}~ 1 denote the (ordered) 
eigenvalues of K and K* , respectively. 

(3) A* = \i for all i G N if and only if Rhi = hi for every i G N . 




where: 



p x\\liuy)^liu x ) - \\ p Y\\q(f x )^q(f Y )-Pi 



and Xi = 01 . 
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(4) A necessary and sufficient condition for \\K*\\ < \\K\\ is that the only 
a = (01, . . . ,ai) £M. 1 for which 

l l 

(6) R^ajhj = y^ j a i h i 

i=i i=i 

is the zero vector in M. 1 . 

Remark 1. Part (3) can be rephrased as follows: tr(K*) = ti(K) if and 
only if Rh{ = hi for every i G N. Also, note that \\K*\\ = A| and \\K\\ = X\. 

PROOF of Theorem 1. We begin by noting that for g G L^(fx) and 
h G Ll(f Y ), we have {Pxh,g) = {h,P Y g). Hence, 

(K*g,g) = (P x RP Y g,g) = (RP Y g,P Y g) = (R^Pyg^^Pyg) > 0, 

which shows that K* is positive. Since K is trace-class, it follows from Lem- 
ma 1 that tr(-RT) = Y^S=i Pi < 00 • Now, since {gi}°Zi is an orthonormal basis 
for Lq(/x), we have 

oo oo oo 

tr(iT) = J2( K *9i,9i) = J2( P x RP r9i,9i) = J^^^Pvgi) 

i=l i=l i=l 

oo oo 

= Y,^(Rh t ,h l ) <J2$ = tr W' 

i=l i=l 

where the inequality follows from the fact that (Rhi,hi) < 1. Thus, K* is 
trace-class. Moreover, it is clear that tr(K*) = tr(K) if and only if {Rhi,hi) = 
1 whenever /3j > 0. Since R is idempotent with norm 1, it is a projection [Con- 
way (1990), page 37]. Thus, for any h £ Lq ( f Y ) , (Rh, h) = (h, h)^Rh = h. 
[Indeed, (h, h) = (Rh, h) + ((I- R)h, h) , so (Rh, h) = (h, h)=>((I- R)h, h) = 
((I - R)h, (I - R)h) = 0.] Consequently, tr(K*) = tr(K) if and only if Rhi = 
hi for every i such that > 0. This takes care of (3). 
Now, note that K — K* = Px(I — R)Py is positive since 

(P X (I - R)P Y g,g) = ((I - R)P Y g,P Y g) = ((I - R)P Y g, (I - R)P Y g) > 0. 

Therefore, for any nonnull g G Lq(/x), we have 

(K*g,g) < (Kg,g) 

(g,g) ~ (g,g) 

Now, for any i G N, the Courant-Fischer-Weyl minmax characterization of 
eigenvalues of compact, positive, self-adjoint operators [see, e.g., Voss (2003)] 
yields 

x * • (K*g,g) . . (Kg,g) 

Aj = mm max — : — < mm max — — = Aj, 

dim(V)=i-lg&V^,g^0 (g,g) dim(V)=i-l g€V±,g^0 (g,g) 
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where V denotes a subspace of Lq(Jx), and dim(V) is its dimension. This 
proves (2). 

All that remains is (4). Assume there exists a nonzero a such that (6) 
holds. We will show that \\K*\\ = \\K\\. Since we know that \\K*\\ < \\K\\ = 
j3\, it suffices to identify a function g S L^(fx) such that = /3^||</||. If 

we take g = a\g\ + • • • + aigi, then 



\\K*g\\ = 
But /3i = • • • = 

\\K*g\\=/3i 



K* ^2 a i9i 
i=i 

Pi, and, hence, 
I 

i=l 



P X RP Y J2 a i9i 



i=l 



i=i 



Pi 



Px ^2 aiht 



i=l 



i=i 



ai9i 



The second half of the proof is by contradiction. Assume that the only agtf 
for which (6) holds is the zero vector, and assume also that \\K*\\ = \\K\\ = 
P\. By completeness of the Hilbert space, L^(fx), there exists a nontrivial 
function g € Lq(/x) such that = The rest of the argument 

differs depending upon whether g is in the span of {g±, . . . ,gi} or not. 

Case I: Assume that g = Yli=i a i9i f° r some nonzero a € M. 1 . Using the 
results above, we have 



\K*g\\ = \\P X RPy9\\ < Pi\\RPy9\\ = P\ 



Ry ajhj 



i=i 



But R is a projection, so Rh ^ h =>■ \\Rh\\ 7^ \\h\\. Hence, R^\ =1 aihi 7^ 
Y^i=i a ihi in conjunction with ||J?|| = 1 yields 



Ry~]aihi 



i=i 



< 



i=i 



\9\ 



Thus, 1 1 < which is a contradiction. 

Case II: Assume that g is not in the span of {g±, . . . ,g{\. In other words, 
9 = Yli=i bi9i where at least one term in the sequence {bi + i,bi +2 , ■ ■ ■} is 
nonzero. Then, 



\\Py9\\ 
It follows that 



00 

Py ^2 bi 9i 
i=i 



1=1 



< 



1=1 



N 



i=l 



\\K*g\\ < \\Px\\q UY) ^ L 2 {fx) \\R\\\\PY9\\ <P 2 i\\g\\, 
and, again, this is a contradiction. □ 
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4. Using a group action to construct R. Following Liu and Wu (1999) 
and Liu and Sabatti (2000), Hobert and Marchev (2008) introduced and 
studied a general method for constructing practically useful versions of R(y, 
dy') using group actions. For the remainder of this section, assume that X 
and Y are locally compact, separable metric spaces equipped with their Borel 
u-algebras. Suppose that G is a third locally compact, separable metric space 
that is also a topological group. As usual, let e denote the identity element of 
the group. Also, let R+ = (0, oo). Any continuous function \ '■ G — > R+ such 
that x{gi92) = x{9i)x{92) for all 51,52 6 G is called a multiplier [Eaton 
(1989)]. Clearly, a multiplier must satisfy x( e ) = 1 an d x{g~ l ) = VxG?)- 
One important multiplier is the modular function, A, which relates the left- 
Haar and right-Haar measures on G. Indeed, if we denote these measures 
by and uj r (-), then uj r (dg) = A(g~ 1 )uji(dg). Groups for which A = 1 are 
called unimodular groups. 

An example (that will be used later in Section 5) is the multiplicative 
group, R+, where group composition is defined as multiplication, the identity 
element is e = 1 and g _1 = 1/g. This group is unimodular with Haar measure 
given by u)(dg) = dg/g where dg denotes the Lebesgue measure on R + . 

Let F:GxY->Y be a continuous function satisfying F(e, y) = y and 
F(9i92,y) = F(gi,F(g 2 ,y)) for all gi,g 2 G G and all y £ Y. The function F 
represents G acting topologically on the left of Y and, as is typical, we ab- 
breviate F(g,y) with gy. Now suppose there exists a multiplier, such that 



for all g € G and all integrable <j> : Y — > R. Then the measure v is called rela- 
tively (left) invariant with multiplier \. For example, suppose that Y = R m , 
v(dy) is the Lebesgue measure, G is the multiplicative group described 
above, and the group action is defined to be scalar multiplication, that is, 
gy = (gyi, gy 2 , ■ ■ ■ , gy m )- Then u(dy) is relatively invariant with multiplier 
X(g) =g m - Indeed, 



We now explain how the group action is used to construct R. Define 



Assume that m(y) is positive for all y G Y and finite for i/-almost all y £ Y. 
For the remainder of this section, we assume that R'.L^fy) — > Lg(/y) is 
the operator that maps h(y) to 



Hobert and Marchev (2008) show that R is a self-adjoint, idempotent Markov 
operator on L^/y). The corresponding Markov chain on Y evolves as fol- 
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lows. If the current state is y, then the distribution of the next state is that 
of gy, where g is a random element from G whose density is 

/7\ fv{gy)x{g) , 

(7) t-\ u l\ d 9)- 

Therefore, we can move from X* = x to X* +1 as follows: draw Y ~ fY\x{'\ x )i 
call the result y, then draw g from the density (7) and set y' = gy, and finally 

drawX* +1 ~/ X | y (-|y')- 

Hobert and Marchev (2008) also show that, if {Y n }^ = Q denotes the Markov 
chain defined by R, then conditional on Yq = y, {Y n }^ =1 are i.i.d. Thus, 
either {^j^Li are i-i-d- from fy, or the chain is reducible. 

Lemma 2. If \\R\\ = 1, then the Markov operator R is a projection onto 
the space of functions that are invariant under the group action, that is, h 
is in the range of R if and only if h(gy) = h(y) for all g € G and all y 6Y '. 

Proof. First, assume that h(gy) = h(y) for all g £ G and all y £ Y. Then 

(Bh)(y) = -L-( h(gy)f Y (gy) X (g)ui{dg) 



m{y) 

Kv) 



G 



fY(gy)x(g)vi(dg) = h{y). 



m{y) j G 

To prove the necessity, we require two results that were used repeatedly by 
Hobert and Marchev (2008). First, 

(8) X{g)m{gy) = &{g~ l )m(y). 

Second, if g £ G and (j> : G — > K is integrable with respect to oji, then 

(9) / c/>(gg- 1 )ui(dg)=A(g) [ (/>(g)ui(dg). 

JG JG 
Now, fix h £ Lg(/y) and g' £ G, and note that 

(Rh)(g'y)= [ h(gg'y)f Y {gg'y)x(g)ui(dg) 



m(g'y) 
1 



G 



X(g')m(g'y) J G 

AG/- 1 ) 



X(g')m(g'y) J G 
1 



h{gg'y)fY{gg'y)x{gg')ui{dg) 
h{gy)fY(gy)x{g)ui(dg) 



Hgy)fY(gy)x(g)ui(dg) 



m(y) j G 
= (Rh)(y), 

where the third and fourth equalities are due to (9) and (8), respectively. □ 
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We now show that when R is constructed using the group action recipe 
described above, there is at least one eigenvalue of K* that is strictly smaller 
than the corresponding eigenvalue of K. To get a strict inequality, we must 
rule out trivial cases in which the DA and sandwich algorithms are the 
same. For example, if we take G to be the subgroup of the multiplicative 
group that contains only the point {1}, then element-wise multiplication of 
y G M. m by g has no effect and the sandwich algorithm is the same as the 
DA algorithm. More generally, if 

(10) fx\y(x\y) = f x \ Y (x\gy) V<? g G,x g X,y g Y, 

then the Mtd of the sandwich chain can be expressed as 

' fr{gy)x{g) 



k*{x'\x) 



fx\Y(x'\gy) 



Y JG 



fx\Y{x'\y) 



Y JG 



m(y) 

fY(gy)x(g) 

m{y) 



-coi(dg) f Y \x{y\x)v{dy) 



ui{dg) f Y \x{y\x)v{dy) 



fx\Y(x'\y)f Y \x{y\x)v{dy) 
= k{x'\x). 

Thus, (10) implies that the DA and sandwich algorithms are exactly the same 
and, consequently, ti{K*) = ti(K). In fact, as the next result shows, (10) is 
also necessary for tr(K*) = tr(i^). 

Theorem 2. // (10) does not hold, then ti{K*) < ti(K), so at least one 
eigenvalue of K* is strictly smaller than the corresponding eigenvalue of K . 

PROOF. It is enough to show that ii(K*) = ti{K) implies (10). Recall 
that N = {i € N:/% > 0}. By Theorem 1, tr(K*) = tr(K) implies that Rhi = 
hi for every i £ N. By Lemma 2, if Rhi = hi for every i G N, then every 
member of the set {hi : i G N} is invariant under the group action. Now, using 
the singular value decomposition, we see that for every g G G,x G X,y G Y, 
we have 



fx\v(x\y) =^2f3 j g j (x)hj(y)f x (v 



j=0 



^2p j g j (x)h j (gy)f x {x) 
j=0 

fx\v{x\gy)- 



□ 



In practice, fx\y{x\y) is known exactly and it's easy to verify that (10) 
does not hold. An example is given in the next section. 
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It is important to note that, while Theorem 2 guarantees strict inequality 
between at least one pair of eigenvalues of K and K* , it does not preclude 
equality of Ai and \\. Thus, we could still have \\K\\ = We actually 

believe that one would have to be quite unlucky to end up in a situation 
where \\K\\ = To keep things simple, suppose that the largest eigen- 

value of K is unique. According to Theorem 1 and Lemma 2, \\K\\ = \\K*\\ 
if and only if hi [from (5)] is invariant under the group action. This seems 
rather unlikely given that the choice of group action is usually based on 
simplicity and convenience. This is borne out in the toy examples analyzed 
by Hobert, Roy and Robert (2011) where there is strict inequality among 
all pairs of eigenvalues. 

Recall from Section 1 that the PX-DA/marginal augmentation algorithm 
is based on a class of transformations t g : Y — > Y, for g £ G. This class can 
sometimes be used [as the function F(g, y)} to construct an R as described 
above, and when this is the case, the resulting sandwich algorithm is the 
same as the optimal limiting PX-DA/marginal augmentation algorithm [Liu 
and Wu (1999), Meng and van Dyk (1999), Hobert and Marchev (2008)]. 

5. A DA algorithm for Bayesian quantile regression. Suppose Z\, Zi, ■ ■■ , 
Z m are independent random variables such that = xj ' (3 + Ei where x\ £ W 
is a vector of known covariates associated with Zi, (3 £W is a vector of 
unknown regression coefficients, and e±, . . . ,e m are i.i.d. errors with common 
density given by 

d(e; r) = r(l-r) [e^ e J K _ (e) + e~ r£ I R+ (e)] , 

where r € (0,1). This error density, called the asymmetric Laplace density, 
has rth quantile equal to zero. Note that when r = 1 /2, it is the usual Laplace 
density with location and scale equal to and 1/2, respectively. 

If we put a flat prior on (3, then the product of the likelihood function 
and the prior is equal to r m (l — r) m s(f3,z), where 

m 

s(/3, Z) := JJ[ e (l-0fe-^/3) Jr _ {z . _ x T p) + e - r{Zl - x Jp) lR+{Zi _ 
i=l 

If s(f3,z) is normalizable, that is, if 

c(z) := / s(f3, z) dj3 < 00, 



then the posterior density is well defined (i.e., proper), intractable and given 
by 



vr o z 



c(z) • 

For the time being, we assume that the posterior is indeed proper. 
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Let U and V be independent random variables such that U ~ N(0, 1) 
and V ~ Exp(l). Also, define 9 = 9{r) = 4=^ and r 2 = r 2 (r) = ^jx- 

Routine calculations show that the random variable 9V + ry/VU has the 
asymmetric Laplace distribution with parameter r. Kozumi and Kobayashi 
(2011) exploit this representation to construct a DA algorithm as follows. 
For i = l,...,m, let (Z^Yi) be independent pairs such that ZjYi = yi ~ 
N(xjf3 + 9yi,yiT 2 ) and, marginally, 1^ ~ Exp(l). Then Z{ — xf (3 has the 
asymmetric Laplace distribution with parameter r, as in the original model. 
Combining this model with the flat prior on j3 yields the augmented posterior 
density defined as 



c'(z) 



m 1 f 1 ] 



where c'{z) = r' m (l — r) m c{z). Of course, f Rm 7r(/3,y\z) dy = ir(/3\z). This 

leads to a DA algorithm based on the joint density 7r(f3,y\z), which is vi- 
able because, as we now explain, simulation from ir(/3\y,z) and ir(y\/3,z) is 
straightforward. 

As usual, define X to be the mxp matrix whose ith row is the vector xJ. 
We assume throughout that m > p and that X has full column rank, p. 
Also, let D denote an m x m diagonal matrix whose ith diagonal element 
is (r 2 yj) _1 . A straightforward calculation shows that 

/%z~N p (/i,£), 

where E = = (X T DX)' 1 , and, letting I denote an m x 1 vector of 

ones, 

n = n(y, z ) = {x T Dxy 1 (x T Dz - ^x T i^ . 

Also, it's clear from the form of n(f3,y\z) that, given (f3,z), the g/j's are 
independent, and yi has density proportional to 



1 f 1 

11 — exp' 

y/Vi 



2r 2 



yi (2T 2 + 9 2 )+( Zi Xl(3)2 " 



in 



This is the density of the reciprocal of an inverse Gaussian random variable 
with parameters 2 + 9 2 /t 2 and \J2t 2 + 9 2 /\zi — xj(3\. Thus, one iteration 
of the DA algorithm requires one draw from a p-variate normal distribu- 
tion, and m independent inverse Gaussian draws. Note that in this example 
X = W and Y = WT:, so both spaces have uncountably many points. 

From this point forward, we restrict ourselves to the special case where 
r = 1/2, that is, to median regression. The proof of the following result, 
which is fairly nontrivial, is provided in Appendix B. 
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Proposition 1. If r = 1/2 and X has full column rank, then the joint 
density upon which Kozumi and Kobayashi's DA algorithm is based satis- 
fies (4)- Thus, the corresponding Markov operator is trace class. 

Remark 2. Proposition 1 implies that, if r = 1/2 and X has full column 
rank, then the posterior is proper, that is, c(z) < oo. First, by construction, 
the function s(/3, z) is an invariant density for the DA Markov chain, whether 
it is integrable (in (3) or not. Now, the fact that the DA Markov operator 
is trace class implies that the DA Markov chain is geometrically ergodic, 
which in turn implies that the chain is positive recurrent. Hence, the chain 
cannot admit a nonintegrable invariant density [Meyn and Tweedie (1993), 
Chapter 10], so s(/3,z) must be integrable, that is, the posterior must be 
proper. 

We now construct a sandwich algorithm for this problem. Let G be the 
multiplicative group, Given y E Y = WV: and g S K+, define gy to be 
scalar multiplication of each element in y by g, that is, gy = (gyi,gy2, ■ ■ ■ i 
9Vm)- Clearly, ey = y and (gig2)y = gi{giy), so the compatibility conditions 
described in Section 4 are satisfied. It is also easy to see that the Lebesgue 
measure on Y is relatively invariant with multiplier x{d) = 9 m - When r = 
1/2, ir(y\z) is proportional to 

\XTDX\W ^{t^ 1 - D^X{X T DX)-'X T D^]D^z} 

m 

x I]V 1/2/ r+(^)- 

i=l 

Therefore, in this case, the density (7) takes the form 
n{gy\z)g r 



m(y) 



-ui(dg) 



0C5 (m-p-2)/2 e - 9 E™ 1 ?/ l 



exp!.-^z T D 1 / 2 [I-D 1 / 2 X(X T DX)- 1 X T D 1 / 2 ]D 1 / 2 z\ 



dg. 



So at the middle step of the three-step procedure for simulating the sandwich 
chain, we draw a g from the density above and move from y = (yi, 7/2, • • • > Vrn) 
to (gyi, gyi, . . . , gy m ), which is a random point on the ray that emanates 
from the origin and passes through the point y. If m happens to equal 
p + 1, then this density has the same form as (11), so we can draw from it 
using the inverse Gaussian distribution as described earlier. Otherwise, we 
can employ a simple rejection sampler based on inverse Gaussian and/or 
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gamma candidates. In either case, making one draw from this density is 
relatively inexpensive. 

Recall that ir(/3\y,z) is a normal density. It's easy to see that, if g 7^ 1, 
then 7r({3\gy,z) is a different normal density, which implies that (10) does 
not hold. Therefore, Theorems 1 and 2 are applicable and they imply that 
the ordered eigenvalues of the sandwich chain are all less than or equal to 
the corresponding eigenvalues of the DA chain, and at least one is strictly 
smaller. As far as we know, this sandwich algorithm has never been imple- 
mented in practice. 

APPENDIX A: PROOF OF LEMMA 1 
Fix i e N. Since f(x,y) = fx(x)fy{y) YljLoPj9j(x)hj(y), we have 

(P x hi)(x) = J^hiiy) ^P j g j (x)h j (y)jf Y (y)u(dy) = p i g i (x). 

A similar calculation shows that Pyg% = fiihi. Now, fix h G Lq(/y). Because 
{hi}'^ 1 forms an orthonormal basis for Lq(/y), we have h = 5Zi=i a i^i- Thus, 



\\PxH 



i=l 



t=i 



, X>f/f </?i||/*||, 
\ i=i 



and we have equality if h(y) = h\{y). Hence, \\Px\\ L 2 (f Y )->L 2 {f x ) = P\. An 
analogous argument shows that \\PY\\L 2 (f x )^L 2 (f Y ) = $1- Now, for each i 6 N, 
we have 

K 9i = PxPvQi = PiPxhi = pjgi- 

But {gi}^Zi form an orthonormal basis of Lq(/x), which proves that K has 
eigenvalues {/3 l 2 }^ :1 . 

APPENDIX B: PROOF OF PROPOSITION 1 

Here we show that the joint density underlying Kozumi and Kobayashi's 
(2011) DA algorithm for median regression satisfies (4). That is, we will 
show that 



n(/3\y, z)n(y\/3, z) d(3 dy < oo. 



Proof of Proposition 1. First, 

ir(y\/3, z) = c -jz exp j — 2^ \Zi - x { j3\ 
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where a = (2t 2 + 9 2 )/t 2 , y. = YliLi Vii V = YliLi Vi> an< ^ c is a constant (that 
does not involve y or /3). Now let 

W = {w G R m : Wi G {-1, 1} for i = 1, 2, . . . , m}. 

For any /? G M p and any a > 0, we have 

{m ^ f m ~1 

o |zj — xf/3| > < exp< <7^^ kil f ^2 e:K -P{ awT X P} ■ 
i=i J I i=i J tuew 



Thus, it suffices to show that, for every w G W, 

e -ay-/2 



exp<{ — io J X/3 z) dp 



dy 



is finite. We start by analyzing the inner integral. First, recall that ir(/3\y,z) 
is a multivariate normal density with mean \i = (X T DX)~ 1 X T Dz and vari- 
ance £ = {X T DX)~ l . Now, 

(2 - Xp) T D{z - Xp) = z T Dz + (P- /x) T S" 1 (/3 - ft) - fjFE- 1 ^. 

Therefore, the integrand (of the inner integral) can be rewritten as 

exp< — -(z Dz — fi L ^)>exp< w Xp 



xexpj-i^-^S- 1 ^-//) 
so the inner integral can be expressed as 

(12) exp{-±(* r Z>* - A^rV)} ^2 expj ^ t X/3}tt(/%, z) d/3, 

where 7r(/3|y,z) is a multivariate normal density with mean fi and vari- 
ance S/2. But the integral in (12) is just the moment generating function 
of P evaluated at the point J~aw T Xjr. Hence, (12) is equal to 

2-P/ 2 e W {--(z T Dz - ^ T S"V) + ^(w T Xfi) + (w T XZX T w)\ . 

{ 2 t At z J 

Now, straightforward manipulation yields 

z T Dz - /i T £- V = z T D l ' 2 (I - D l l 2 X{X T DX)~ l X T D l / 2 ) 2 D 1 l 2 z > 0. 

It follows that e -( zTDz -^ T ^ V)/2 < i. A similar calculation reveals that 
w T X'E x X T w < w T D~ l w = T 2 y.. Hence, (12) is bounded above by 

2-P/ 2 e W ^(w T X(X T DXr 1 X T Dz) + ^ J. 
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Thus, it only remains to show that, for any w € W, 

f _L expl-^f + ^(w T X(X T DX)- 1 X T Dz)\dy < oo. 

Jwp Vy I 4 T J 

We will prove this by demonstrating that w T X(X T DX)~ 1 X T Dz is uni- 
formly bounded in y. 

It follows from the general matrix result established in Appendix C that, 
for each i £ {1,2, ... , m} and all (y±, y 2 , . . . , y m ) E M™, 



X; | .L .1 ; 



+ £ 



je{l,2,...,m}j5«y 

where Ci(X) is a finite constant. Thus 



—XjxJ ) Xi < Ci(X), 
Vj ' 



\\(X I DX)- 1 X T Dz\\ 2 



i=i 



s£ 

i=l 
m 

= £ 

i=l 
m 

=£ 

i=i 

JTI 



(X T DX) 



Xi Zi 



r 2 yi 



r 2 yi 



m _ ._T\ _± 



E 



2„. 



" T Vj J 'V'i 



je{l,2,...,m},j/ 



i=l 



+ E 



je{l, 2,. ..,m},i^i 



2/j T . 

Vj 



<J2\zi\Ci(X). 



i=i 



Hence, 



\w T X(X T DXy 1 X T Dz\ = \\w T X(X T DXy 1 X T Dz\\ 2 

< \\w T X\\ 2 \\(X T DXy 1 X T Dz\\2 
is uniformly bounded in y. This completes the proof. □ 
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APPENDIX C: A MATRIX RESULT 

Fix xi, X2, ■ ■ ■ , x n € W where n and p are arbitrary positive integers. Now 
define 

sup x\ (x\x\ + al p )~ 2 xi, if n = 1, 

aeM + 

/ n X-2 

SUp xf Xixf + 2_] a i x iXi + CLiIp\ X\, if H > 2. 

V ^ / 



Lemma 3. C Pjri (xi; x 2 , • ■ • , x n ) < oo. 

Proof. We use induction on p. Note that when p = 1, we have 

x 2 [ 0, si = 0, 

Ci, n (xi;x 2 ,...,a; n )= sup 2 1 — — ^ = < J_ , 

which is finite in either case. Thus, the result is true for p=l. 
Now assume that for any n S N and any x±, . . . ,x n G M p_1 , 

Cp— l,n(^l ;x2,...,x n ) <oo. 

We will complete the argument by showing that, for any n £ N and any 
xi, R p , C Pjn (xi; ^2, . . . , x n ) < oo. The result is obviously true when 

x\ = 0. Suppose that xi 7^ 0, and let P be an orthogonal matrix such that 
Px\ = ||xi||2ei, where e± = (1,0,0, . . . ,0) T G W. For i = 2,3, . . . ,n, define 
bi = Px{. Then we have 



x\ yx\x\ + ^ajXjxf + ailpj x\ 

= xj (^P T Pxix{P T P + a i P T Px i x T i P T P + a x P T P^ x x 
= xf ^P T ^\\xi |||eief + ^2 aAbf + ail p ^j P^j x\ 
= xjp- 1 ( ||xi||| ei ef + aibibj + ai/^ (P 7 )" 1 



i=2 
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P' 1 ( Nllleiei +^2a l b l bf + a 1 I p ) (P T )~ 1 x l 



i=2 
ii 



\ x i\\2 e T\ \\xi\\leie{ + y ^ i a i bibf + ail p \ &\. 



i=2 



Now let A = {i G {2, . . . , n} : bj ei = 0} and let B = {2,...,n} \ A, that is, 
B = {i e {2, . . . ,n} : bje\ / 0}. If i € A, then there exists a Vi G M^ 1 such 
that 



and, if i £ -B, then there exists a nonzero real number m and t>j £ M p 1 such 
that 



Thus, we have 



n \ 

+ ^ajXjxf + ail p xi 

i=2 / 

i=2 



kills*! 



Ikilll o r 




J 

Vivf 



T 



1 ^ 



+ atl p ei 



l^llllef 



It tr 



ei, 



where ti := + ^2 ieB aiuf + ai, « := Yli^B a i u l v i and 

ieA ieB 

If B is empty, then v is taken to be the zero vector in The formula 

for the inverse of a partitioned matrix yields 

-i 



u v 
v W 



1 



u — V 



T W~h 
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1 -v T W- 1 
-W~ x v (u - v T W~ 1 v)W~ 1 + W~ 1 vv T W~' 1 



It follows that 



u v 
v W 



ei 



l + v T W~ 2 v 
(u-v T W- l v) 2 ' 



If n = 1 or B is empty, then 

Cp 7n (xi;x2,...,x n ) = \\xi\\ 2 , sup 



(IM! + ai) 2 IMI 



< oo, 



so the result holds. In the remainder of the proof, we assume that n > 2 
and B is not empty. 
Note that the matrix 



u-||xi||| v T 
v W 



^aibibf + axlp 



i=2 



is positive definite, which implies that it's determinant is strictly positive, 
that is, 

\W\(u-\\x l \\l-v T W- l v)>0. 
Since W is also positive definite, u — v T W~ 1 v > ||ari |||- Moreover, 



v T W- 2 v = \\W- 1 v\\ 2 2 
Therefore, 



aiU 2 Vi 



< 



ieB 



l + v T W -2 v < l + \£ ieB \\W-Ha i u 2 v i )\\2} 



{u-v T W~ l v) 2 
Putting all of this together yields 



+ diXixf + a\I p x\ < 



i=2 



^ + [^BW- 1 {a l u 2 v i )U 
INII 



Recall that A U B = {2, 3, . . . , n}. For fixed i £ B, let ki i, ki%, . . . , k{ tn -2 de- 
note the n — 2 elements of the set {2, 3, . . . , n} \ {i}. Then we have 

\\W- l (a lU 2 v t )\\ 2 



v i I v i v i + ]> -^2 V i v j + 2 V J V j H 2 J P-1 



< Cp-i^-i^; v ki l , v ki 2 ,. . . , v k ). 
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Thus, using the induction hypothesis, we have 

Cp ) n(x 1 ;x 2 ,-..,x n )= sup xi I x\x\ + ajXjxf + a\l v 1 xi 



aen 



i=2 



1 + EieB V Cp-i,n-i(vi;v k , u fc , . . . , ufe )]' 

^ " \T1T2 

F1II2 

which is finite. This completes the proof of the lemma. □ 



Remark 3. Note that if xixf + Y17=2 a i x i x f ^ s invertible for every 
(s 2 ,...,a n )el"" 1 , then 

C Pt n(xi;x 2 ,...,x n )= sup x\ 1 x\x\ + ajXjxJ 1 xi. 

(a 2 ,...,o„)eiR™- 1 V i=2 / 
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